home *** CD-ROM | disk | FTP | other *** search
-
- ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
- ^ ^
- ^ DOCUMENTATION FOR ^
- ^ ^
- ^ SPELLF !Tiny Trainable Spellfilter Ver. 1.01 ^
- ^ ^
- ^ (c) 1989 Kas & Rita Thomas. All rights reserved. ^
- ^ ^
- ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
-
- You may distribute SPELLF to your friends. But if you do so, please
- register your copy with us today, by sending $10 (in cash, check,
- M.O., stamps, whatever) to: Kas & Rita Thomas, 578 Fairfield Ave.,
- Stamford, CT 06902.
-
- This program was created using Borland Turbo C (Ver. 2.0). If you
- would like fully commented source code for SPELLF, kindly send $25 in
- lieu of the registration fee. In other words: For $25, you get source
- code AND registration, all in one shot (on disk).
-
- When you write, please tell us about your computer setup. Are you
- using SPELLF on a laptop? Which model? (Are you happy with it?) We'd
- be delighted to hear from you. Write to us today!
-
-
-
-
-
- WHAT IS SPELLF? AND: WHAT ISN'T IT?
-
- SPELLF is a spellfilter. We feel this word better describes the spirit
- of this utility than the term "spellchecker," which implies a literal
- brute-force comparison of words against a dictionary. SPELLF does not
- do this, obviously. (The "dictionary" file used by SPELLF is only 8K
- in size!) Rather than check literal spellings, SPELLF checks the
- "legality" of substrings within each word and reports out any words
- that contain illegal letter combinations. In this way, SPELLF acts
- more like a logical filter than a mere word-looker-upper. You give a
- text file to SPELLF, and the program acts as a sieve, reading the text
- file in and passing suspect words to "stdout" (the standard output
- device for your computer; usually the console). The output is fully
- redirectable, which means you can have it show up on your printer (or
- on disk) rather than on your screen.
-
- SPELLF carries out its filtering at lightning speed: On the slowest 8-
- MHz XT clone we could find, SPELLF took only 10 seconds to spellcheck
- a 120K text file! (It adds words to the main dictionary at the
- blistering rate of 1,000 English words -- about 6 Kbytes -- per
- second.) On a 286 or 386 machine, you may see throughput rates five to
- ten times higher. Especially if you use redirection. (The biggest
- runtime bottleneck for SPELLF is waiting for the screen to update.)
-
- Speed isn't SPELLF's primary virtue, however. Small size (both in
- terms of disk image and RAM overhead) is where SPELLF really shines.
- The program itself is just 12K, making it perfect for use on floppy-
- disk-only systems (laptops, for example) and systems that can't
- accommodate the 200K spellcheckers that come with such programs as ...
- well, you know the big names. At runtime, SPELLF uses approxi-
- mately 76K of RAM (12K of core RAM and 64K of "far core" heap
- space). For its dictionary, SPELLF uses a disk file only 8 kilobytes
- in size. No matter how many words you stuff into this dictionary file
- ($.DIC), it never gets bigger than precisely 8K.
-
- Another virtue of SPELLF is that it operates in command-line mode,
- making it easy to invoke spellchecking from batch files.
-
- And, SPELLF's output (a list of suspect words, with their line
- locations in the original file) is redirectable. That is, you can make
- the suspect-word list show up at your printer, or in a new file on
- disk, rather than on the screen. (Again, this approach lends itself to
- batch file applications.)
-
-
-
-
-
- DISCLAIMERS
-
- Whether or not you send us money, no warranties are made with respect
- to the program's performance, fitness, hardware compatibility, etc.
-
- To run this program, you need at least 72K of free RAM and 20K of disk
- space at all times. Any version of DOS should be fine.
-
- Note: This program is designed to spellcheck standard ASCII text files
- only. Nothing disastrous will happen if you attempt to spellcheck a
- WordStar file. SPELLF merely ignores high-bit characters. For
- this reason, control codes that fall outside the normal range of
- alpha-numeric ASCII characters pose no problem whatsoever. But if
- your word processor uses /b or /i for font calls, you may see
- erroneous exception-word callouts on the screen.
-
- If there is sufficient demand, we will rewrite the program to take
- care of format code detection. (Don't hold your breath, however!)
-
-
-
-
-
- THE $.DIC FILE
-
- Your program should have come bundled with an 8K file called $.DIC.
- SPELLF looks for this file at runtime; it contains the information
- needed to conduct spellchecking (or spellfiltering) operations. This
- file MUST be present in the current working directory, for SPELLF to
- work properly.
-
- The $.DIC dictionary file that we created for bundling with this
- program contains data for 20,000 commonly used English words. If you
- did not get a copy of $.DIC, don't panic: SPELLF will let you create
- your own $.DIC file, but you will have to supply your own core word
- list (of properly spelled words) with which to make the file. To
- create a brand-new $.DIC file from scratch, do this:
-
- 1. Be sure you are in a directory that contains no $.DIC file. (If a
- $.DIC file already exists, rename or delete it; otherwise the existing
- file will be UPDATED to contain the new word list.)
-
- 2. Run SPELLF with two arguments as follows:
-
-
- SPELLF {filename of word list} -a [Enter]
-
-
- Don't forget the "-a" on the command line. This tells SPELLF to ADD
- the contents of {filename} to $.DIC.
-
- If $.DIC did not exist before, a new file named $.DIC will show up in
- your current directory when you follow the above procedure. You will
- then be able to "spellcheck" any file against the $.DIC dictionary.
-
- If a $.DIC file already existed in the current directory, the contents
- (vocabulary) of {filename} will simply be ADDED to $.DIC. The previous
- contents of $.DIC will be preserved. It is a true ADD operation.
-
-
-
-
-
-
- CREATING CUSTOM DICTIONARIES
-
- Obviously, you can use the foregoing procedure to create any number of
- custom dictionaries (scientific word lists, lists of foreign words,
- computer terms, etc.) for use with the SPELLF program. All you need to
- do is supply the name of a file containing a list of scientific terms
- (or whatever) on the command line:
-
-
- SPELLF NEWWORDS.TXT -a [Enter]
-
-
- In this case, the vocabulary of NEWWORDS.TXT will be added to $.DIC.
- (Once again: If $.DIC did not already exist in the current working
- directory, it will be created. If it exists, it will be updated.) The
- filename does not have to have a .TXT extension; this is merely an
- example. Any extension is permissible. (Caution: If you accidentally
- supply the name of a spreadsheet or other non-text file, you stand a
- chance of corrupting your existing $.DIC file. Always keep a "good"
- copy of $.DIC on a backup disk somewhere.)
-
- Note that SPELLF is not merely an English-language spellchecker. If
- you happen to have a French word list on disk, you can use it to
- create a French-language version of $.DIC, and hence a French-language
- spellchecker. The copy of $.DIC that comes bundled with SPELLF happens
- to contain 20,000 English words. It could just as easily contain
- French, Spanish, German, or Italian words, etc.
-
-
-
-
-
- SPELLCHECKING
-
- To spellcheck a text file, just type SPELLF at the DOS prompt and
- supply the name of the text file you wish to examine. Then hit Enter.
- Example:
-
-
- SPELLF {filename} [Enter]
-
-
-
- If you type SPELLF DRAFT.DOC [Enter], the file DRAFT.DOC will be
- rigorously spell-checked using the words contained in $.DIC. (Remember
- that SPELLF expects the file $.DIC to be in the same directory as
- DRAFT.DOC.) A complete list of suspect words and their line locations
- will be printed to the screen. If you want, the list of suspect words
- can show up in a separate disk file, or be printed out on your
- printer. This is called redirection.
-
- To redirect the output of SPELLF to a brand-new file called BADWORDS,
- simply type:
-
-
-
- SPELLF DRAFT.DOC > BADWORDS [Enter]
-
-
-
- During program execution, in this case, nothing will happen on the
- screen (but the disk will be active). When the DOS prompt returns, you
- should see that there is a new file, BADWORDS, in your current
- directory. You can look at this file with the DOS command TYPE, or use
- your favorite word processor to open it.
-
- Note that SPELLF operates much faster when redirection to a file is
- used. That's because the process of displaying output on the screen
- (in SPELLF's default mode) is inherently slow -- slower, at least,
- than SPELLF's file I/O operations.
-
- You may wish, occasionally, to redirect output to your printer. Again,
- you can use DOS to do this:
-
-
-
- SPELLF DRAFT.DOC > PRN [Enter]
-
-
- This causes the file DRAFT.DOC to be spellchecked and the suspect
- words to appear at the printer, rather than at the screen.
-
-
-
-
-
-
- TESTING WORDS FOR PRESENCE IN THE DICTIONARY
-
- You can test words for presence or absence in the dictionary.
- The procedure is very simple: Just type SPELLF and -t. Like this:
-
-
-
- SPELLF -t [Enter]
-
-
-
- As soon as the program loads, a query message comes up on the screen
- asking you to supply a word. Just type it on the screen; then hit
- Enter. SPELLF will consult $.DIC to see if the word you typed is
- already in the dictionary. If it is, you'll be told so, and you will
- be asked if you want to test another word. (Type 'y' or Enter to
- answer in the affirmative.) If the word you typed was NOT in the
- dictionary, you'll be told so, and you will be given a chance to add
- it. No extra typing is required to add the word: Just hit Enter OR
- type 'y' for yes. (You must type 'n' if you do NOT wish to add the
- word to $.DIC.)
-
- You may keep checking words in this fashion for as long as you like.
- The loop will stop when you answer 'n' (No) to the "Test another
- word?" prompt.
-
- "Test" mode thus offers a second way (besides "Add" mode; see
- "CREATING CUSTOM DICTIONARIES," above) to update the dictionary. You
- can update the $.DIC dictionary one word at a time simply by entering
- the test-word loop as above, quitting at any convenient point.
-
- Obviously, when you have more than a handful of words to add to the
- dictionary, it makes sense to create a text file containing just the
- words in question, and "feed" it to SPELLF in "Add" mode by typing
- SPELLF <list name> -a , then Enter.
-
-
-
-
-
- SPELLCHECKING VS. SPELLFILTERING
-
- Conceptually, there are two ways to attack the problem of
- spellchecking: You can approach it in a brute-force manner, comparing
- every word in a file with entries in a dictionary . . . or you can
- approach it as an "expert system" type of problem, wherein an attempt
- is made to catalog all known rules of spelling (and spellcheck on a
- logical basis rather than a lookup basis). The first approach -- brute
- force lookup of words in an all-encompassing dictionary -- is familiar
- to anyone who has ever used a fully featured word processor. This
- approach is, in theory, infallible, provided the dictionary is
- complete. In the real world, no dictionary is ever large enough to
- approach completeness, nor CAN a dictionary ever be foolproof, due to
- the changing nature of the language. Proper nouns are always a
- problem, and new coinages are always entering the language. Indeed,
- the inherent "inventivity" of English is one of its most endearing
- characteristics. But this very property dooms all orthodox spell-
- checkers to failure.
-
- The second approach -- that of examining the rules governing spelling,
- and constructing a spellchecker based on those rules -- happens to
- coincide with the approach taken by most hyphenation programs. Most
- hyphenators are state machines whose main operations consist of affix
- analysis and word-root parsing. The same approach can be extended to
- spellchecking. Why not simply examine patterns of letters, and ask
- whether the individual patterns are "legal constructs" or "illegal
- constructs" based on known rules of English spelling? (Here, by
- patterns we mean more than just two-letter combinations.) An
- adaptation of this approach is used in SPELLF.
-
- The state-machine approach has many advantages over the brute-force
- method. For one thing, it requires no bulky dictionary (which, in many
- conventional programs, consumes 200K or more of disk space IN
- COMPACTED FORM). For another thing, it means that, in theory anyway,
- English words that don't exist yet can be spellchecked properly,
- because new coinages will always (presumably) follow the rules of
- English-language letter-combining. (That is, it's unlikely anybody
- will coin words that contain unpronounceable combinations like "xbjt"
- or "tqtpp.") A rule-based program should be able to spellcheck new
- coinages that it has not seen before. This is a formidable advantage
- over conventional spellchecking systems.
-
- SPELLF incorporates these ideas. Plus, it does so in ways that are
- simple, effective, and conducive to rapid program execution.
-
- When we say that SPELLF is a spellfilter, we mean simply that the goal
- of the program is to filter out illogical spellings and yield up a
- list of words whose letter-combinations don't make sense from the
- standpoint of known rules of English spelling.
-
-
-
-
-
- HOW THE PROGRAM WORKS (A TECHNICAL ASIDE, FOR PROGRAMMERS ONLY)
-
- If you're wondering how 20,000 or more English words can fit into an
- 8,192-byte dictionary file, consider that in 8,192 bytes there are
- 65,536 bits, each of which can represent a unique hash code. The $.DIC
- file that we've been discussing so far is not really a dictionary file
- in the true sense of the word, but a hash table, each one-bit entry of
- which represents a 16-bit hash code derived from a four-byte substring
- of whatever word is currently being examined. Words less than five
- letters long correspond to single entries in the hash table; words
- with five or more letters are carried in the table as multiple hashes.
- A five-letter word has two hash entries; a six-letter word has three
- hash entries; etc.
-
- During runtime, every word in a file is "factored out" into its
- constituent hash codes (however many there are), and each
- corresponding hash position in the $.DIC file is checked. If the hash
- check reveals a "legal" combination of four letters, the check
- succeeds; if the hash is illegal (representing a letter combination
- not found in the table), it fails and the word is flagged as
- "suspect."
-
- A 16-letter word such as "incomprehensible" contains 13 four-letter
- substrings and has 13 corresponding hash codes, each of which must be
- checked against entries in the $.DIC table. If even one check fails,
- the word is flagged as suspect, and printed to "stdout." The word
- "incomprehensible" must pass all 13 checks in order to be presumed
- correctly spelled.
-
- Notice that all checks are done in RAM; the disk is not accessed 13
- times when "incomprehensible" is checked against $.DIC, because $.DIC
- is captured to RAM at runtime.
-
- The concept of hashing spellcheckers is not new; McIlroy created one
- for the PDP-11 (IEEE Trans. Comm. COM-30, Jan. 1982, pp. 91-99), and
- another attempt at such a program is discussed in CRAFTING TURBO C
- SOFTWARE COMPONENTS & UTILITIES by Richard S. Wiener (1988, Wiley &
- Sons). What is new about SPELLF is the small size of the hash table,
- and the concept of "spellfiltering" as opposed to conventional
- spellchecking from within a text document. No one talks about these
- programs as filters. But that's essentially what they are. You run
- SPELLF on a file in order to filter out bad words -- "suspect" words.
- The spelling is never actually "checked" in the literal sense.
-
- One can argue as to whether a 16-bit hash is adequate for a four-
- letter substring. We start from the assumption (which is demonstrably
- true) that five bits of information will suffice to encode 26 letters
- of the alphabet. We also know that 16 letters can be mapped precisely
- into four bits. Something slightly more than four bits is needed to
- unambiguously encode 26 letters, but the process of mapping 26 letters
- into four bits isn't as dangerous as it sounds, because: (1) the 16
- most-frequently-used letters of the alphabet are used with very high
- frequency indeed, and (2) the remaining 10 letters can be assigned
- hash positions that tend to map over known-good portions of the hash
- table. Accordingly, in SPELLF, the ASCII table is renumbered to
- reflect frequency of usage and best-fit mapping, so that hash
- collisions are seldom "fatal" in the sense of allowing a wrong
- spelling to score as "right."
-
- Still, SPELLF is not perfect. There are times when a misspelled word
- is counted as correct. But this happens very infrequently. (If you
- want to get a feel for this, run the program in "test" mode and try to
- trick the dictionary with various misspellings of common words.)
- Please understand that we make no guarantee of SPELLF's spelling
- accuracy, nor do we contend that it is untrickable. SPELLF is
- imperfect -- like every spellchecker.
-
- What SPELLF lacks in precision, however, it more than makes up for in
- speed, ease of use, and compactness (to say nothing of its
- adaptability to batch file technique, redirectable output, etc.), and
- on balance, we feel the "filter" approach is every bit as useful in
- day-to-day usage as the orthodox big-dictionary brute-force approach.
- We'd love to know what YOU think; write to us at 578 Fairfield Ave.,
- Stamford, CT 06902. And enclose $25 for full Turbo C source code,
- ready to compile in the tiny model.
-
- * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
-
-
-
-
-